These notes are based on Chapter 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.

Introduction

In principle, the acceptance-rejection method can work in a high-dimensional problem if we can find a PDF \(g\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF. In 1-D cases, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(\text{U}(0,Cg(x))\) that are smaller than \(f(x)\). This is unlikely the case in high-dimensional problems. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the rejection rate exponentially increases with dimensions, and we may not be allowed to wait for hours or days.

Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate samples from an approximate distribution that increasingly resembles a target distribution. A sequence of samples are generated by simulating a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with only approximate distributions, MCMC enables us to handle functions on high-dimensional spaces, which are common in modern complex problems.

Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.

Markov chain

Definition and notations

A Markov chain is a sequence of random variables \(X_1,X_2\dots\) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]

In an IID sequence (e.g., Bernoulli process), futures are independent even of the present.

If each \(X_i\) is a discrete random variable, \[P(X_{i+1}=x_{i+1}|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}=x_{i+1}|X_i=x_i).\] In particular, if \(x_{i+1}\) is an element of a finite set that is common for all \(i\geq0\), then the Markove chain can be represented by the transition matrix, which we denote by \(Q\). That is, for all \(i\geq0\), \[Q(j,k) = P(X_{i+1}=k|X_i=j) = P(X_1=k|X_0=j)\] where \(j,k\) are elements of the finite set, which is usually called state space. For instance, if we start with \(X_0=j\), the probability of \(X_2=k\) is \[\begin{align} P(X_2=k|X_0=j) &= \sum_{s=1}^6 P(X_2=k,X_1=s|X_0=j)\\ &= \sum_{s=1}^6 P(X_2=k|X_1=s)\cdot P(X_1=s|X_0=j)\\ &= \sum_{s=1}^6 Q(j,s)\cdot Q(s,k)\\ &= Q^2(j,k)\\ \end{align}\]

As a concrete example, the following figure illustrates a Markov chain with state space of \(\{1,2,3,4,5,6\}\).

Q
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  0.0  0.2  0.0  0.3  0.5  0.0
## [2,]  0.5  0.0  0.5  0.0  0.0  0.0
## [3,]  0.3  0.0  0.0  0.7  0.0  0.0
## [4,]  0.0  0.0  0.0  0.0  0.0  1.0
## [5,]  0.0  0.0  0.0  0.8  0.0  0.2
## [6,]  0.0  0.0  0.1  0.0  0.9  0.0

For example, from \(Q^2\), you can read off the probability of \(X_2=5\) given \(X_0=3\) as 0.15.

Q%*%Q
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.10 0.00 0.10 0.40 0.00 0.40
## [2,] 0.15 0.10 0.00 0.50 0.25 0.00
## [3,] 0.00 0.06 0.00 0.09 0.15 0.70
## [4,] 0.00 0.00 0.10 0.00 0.90 0.00
## [5,] 0.00 0.00 0.02 0.00 0.18 0.80
## [6,] 0.03 0.00 0.00 0.79 0.00 0.18

In general, the probability of \(X_n=k\) given \(X_0=j\) is \(Q^n(j,k)\).

For most MCMC applications, we are interested in continuous random variables, i.e., the state space is uncountable. In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]

In particular, we assume the conditional PDF \(f_{X_{i+1}|X_i}(y|x)\) does not depend on \(i\) and denote it by \(q\). That is, for all \(i\geq0\), \[q(y|x) = f_{X_{i+1}|X_i}(y|x)=f_{X_1|X_0}(y|x),\] where \(x,y\in\mathbb{R}^d\) are states. We call \(q\) the transition density of the (continuous-state) Markov chain.

A state distribution (i.e. marginal distribution) \(f\) is said to be stationary if \(f\) satisfies the following global balance equations: \[f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i}.\]

It is a set of global balance equations, as it holds for all \(x_i\).

To see how special the global balance is, recall that we always have \[f_{i+1}(x_{i+1}) = \int f_{i}(x_{i})q(x_{i+1}|x_{i})dx_{i},\] by marginalisation, i.e. integrating out \(x_i\) from the joint PDF. In general, two marginal PDFs \(f_{i+1}\) and \(f_{i}\) need not be the same. However, if they happen to be identical under the transition \(q\), then \(f_{i+1} = f_{i} = f\) is a stationary distribution in the Markov chain characterised by \(q\).

Why Markov chain?

Given initial value/state \(x_0\), using \(q\), we can simulate a Markov chain, i.e. sequentially obtain samples \(x_1,x_2,\dots,x_n\).

The rationale for Markov chain as a method for sampling from a desired target distribution \(f\) is:

  1. Under certain conditions, a Markov chain converges to a unique stationary distribution as \(n\to\infty\).
  2. We can design a Markov chain whose limiting distribution (hence, stationary distribution) has the target PDF \(f\).
  3. After running the Markov chain long enough \((i > I)\), we may expect the converging distribution to be almost stationary and samples \(x_{I+1},x_{I+2},\dots,x_n\) with \(n>>I\) to resemble IID samples from \(f\).

Samples from the first \(I\) steps (so-called “burn-in” period) are often discarded.

It is easy to approximate the limiting distribution for a finite-state Markov chain. For instance, for \(Q\) defined above, the stationary distribution \(p\) satisfies \[pQ = p,\] an eigenvector of \(Q^T\) for eigenvalue of \(1\).

p <- Re(eigen(t(Q))$vectors[,1]) # eigenvector for eigenvalue = 1
p <- p/sum(p) # normalisation
p
## [1] 0.011978917 0.002395783 0.035936751 0.283660757 0.318639195 0.347388596

We can see \(Q^{500}\) having the identical rows, each of which is almost equal to the stationary distribution.

n <- 500
Qn <- Q%^%n
Qn
##            [,1]        [,2]       [,3]      [,4]      [,5]      [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [2,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [3,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [4,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [5,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [6,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886

What does this imply? For example, suppose \(X_0=4\) and \(p_0 = (0,0,0,1,0,0)\). Then,

p0 <- c(0,0,0,1,0,0)
p0%*%Qn
##            [,1]        [,2]       [,3]      [,4]      [,5]      [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886

In general,

p0 <- runif(6)
p0 <- p0/sum(p0) # random initial state distribution
p0%*%Qn
##            [,1]        [,2]       [,3]      [,4]      [,5]      [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886

No matter which state we are in at the beginning, we will end up with the same state distribution at time 500 onwards.

In other words, an estimate of this limiting distribution using MCMC is

n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:6, 1) # random initial state
for (i in 2:n) {
  x[i] <- sample(1:6, 1, prob=Q[x[i-1],]) # MCMC sampling
}
table(x[501:n])/(n-500) # 500 burn-in periods
## 
##           1           2           3           4           5           6 
## 0.011924962 0.002450225 0.035855928 0.283771886 0.318661331 0.347335668

In contrast, if we sample from the true stationary distribution, an MC estimate is

x <- sample(1:6, n-500, replace=T, prob=p)
table(x)/(n-500)
## x
##           1           2           3           4           5           6 
## 0.012010005 0.002455228 0.035841921 0.283684842 0.318441221 0.347566783

Design principles

The task in MCMC applications is to design a good Markov chain so that the chain converges to the desired target distribution. The following are key design principles.

Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.

Irreducibility ensures that every state \(x\) can be visited from every other state within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated through dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\).

Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.

See Roberts & Rosenthal (2004) for more details.

Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations in order to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging because of the integral in the definition the global balance equations. So, we typically impose stronger structure on \(q\) called the detailed balance equations: \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\]

By integrating both sides over \(x\) (or \(y\)), we see the detailed balance equations imply the global balance equations.

Demos

Here is an example of MCMC for sampling from a bivariate normal distribution \(\text{N}(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{bmatrix}1 & 0.7\\0.7 & 1\\\end{bmatrix} .\]

Remember that the joint PDF looks like this:

Using Gibbs sampler, a result may look like the following.

set.seed(123)
n <- 300
m <- 3

# Initialisation
x <- -1
xx <- rep(x, n)
y <- 2
yy <- rep(y, n)

# Run
for (i in 2:n) {
  x <- rnorm(1, mean=r*y, sd=sqrt(1-r^2))
  xx[i] <- x
  y <- rnorm(1, mean=r*x, sd=sqrt(1-r^2))
  yy[i] <- y
  par(mar=c(5,5,.1,.1))
  plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
  if (i <= m) {
    segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
  } else {
    points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4)
    segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
             xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
  }
  arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}

Pay attention to the dependency of each new point on the previous point; each arrow goes only so far. Each is not an IID sample from the target distribution. But, collectively, these points are distributed as if each is sampled from the target distribution.

Another example that consists of two clusters gives us an even clearer picture of dependency. Two clusters consists of \(\text{N}(\mu_0,\Sigma)\) and \(\text{N}(\mu_1,\Sigma)\), where \[ \mu_0=\begin{bmatrix}-1.7\\-1.7\end{bmatrix},\; \mu_1=\begin{bmatrix}1.7\\1.7\end{bmatrix}, \;\text{and}\quad \Sigma=\begin{bmatrix}1 & 0\\0 & 1\\\end{bmatrix}\] and the cluster assignment \(C\) follows \(\text{Bernoulli}(0.5)\). In other words, the joint PDF is \[f(x_1,x_2,c) = \frac{1}{2}\frac{1}{2\pi}\exp\left(-\frac{1}{2}(x-\mu_c)^T(x-\mu_c)\right)\]

set.seed(123)
n <- 300
m <- 3
burn_in <- 5000

# Initialisation
x <- 0
y <- 0
c <- C[1]

# Run for the burn-in
for (i in 2:burn_in) {
  x <- rnorm(1, c)
  y <- rnorm(1, c)
  k <- f(x,y,C[1]) + f(x,y,C[2])
  c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
}

# Run for the plot
xx <- rep(x,n)
yy <- rep(y,n)
cc <- rep(c,n)
for (i in 2:n) {
  x <- rnorm(1, c)
  xx[i] <- x
  y <- rnorm(1, c)
  yy[i] <- y
  k <- f(x,y,C[1]) + f(x,y,C[2])
  c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
  cc[i] <- c

  par(mar=c(5,5,.1,.1))
  plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-4,4), ylim=c(-4,4),
       pch=20, cex=.4, col=cc[1]/C[2]+3)
  if (i <= m) {
    segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
  } else {
    points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4, col=cc[2:(i-(m-1))]/C[2]+3)
    segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
             xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
  }
  arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}

As you can see, most samples are followed by samples in the same cluster. If samples were independent, the arrow would jump between the clusters almost every other time.

Since convergence requires more time in this example than the first example, 5000 earliest samples are discarded as a burn-in period.

A takeaway message at this point is that, by running MCMC for a long time, we can collect samples as if they were drawn from the target PDF. Since MC estimates often require many samples anyway, running for a long time may not seem like an issue. However, it may be too long and collecting much more samples than required for good MC estimation. In practice, we prefer 1,000 good samples to 1,000,000 good samples when 1,000 good samples are good enough.

Metropolis-Hastings algorithm

In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven design templates. The Metropolis-Hastings algorithm provides a good template. Suppose we have an ergodic Markov chain characterised by a transition density \(q\). The Metropolis-Hastings algorithm tells us how to build upon this Markov chain and design another one that satisfies the detailed balance equations with respect to the target \(f\).

The ergodiciy of the new chain is implied by the ergodiciy of the origianl Markov chain with \(q\).

The idea is similar to the acceptance-rejection method.

  1. Propose \(x_{i+1}\) drawn from \(q(\cdot|x_i)\) as the next sample.
  2. The proposal is accepted with probability \(\alpha(x_i,x_{i+1})\) or rejected otherwise. If rejected, \(x_{i+1}=x_i\).

The acceptance probability is defined as follows: \[\alpha(x,y) = \min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right) .\]

An intuition is the following. If we have a very good proposal density \(q\), which is very similar to the target function \(f\), we will have both ratios \(f(y)/q(y|x)\) and \(f(x)/q(x|y)\) close to 1, leading to \(\alpha\) close to 1. Hence, most of the time, we accept proposals.

The new transition density for \(x_{i+1}\neq x_i\) is \[\tilde{q}(x_{i+1}|x_i) = q(x_{i+1}|x_i)\alpha(x_i,x_{i+1}) .\] Less importantly, the probability of not moving is \[P(X_{i+1}=x_i|X_i=x_i) = 1 - \int_{y\neq x_i} q(y|x_i)\alpha(x_i,y)dy .\]

Let’s make sure that \(\tilde{q}\) satisfies the detailed balance equations: \[f(x)\tilde{q}(y|x) = f(y)\tilde{q}(x|y), \;\forall x,y.\] First, notice that, regardless of \(\tilde{q}\), it trivially holds for \(y=x\). This is why \(X_{i+1}=x_i\) is unimportant. Next, for \(y\neq x\), \[\begin{align} f(x)\tilde{q}(y|x) &= f(x)q(y|x)\alpha(x,y)\\ &= f(x)q(y|x)\min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right)\\ &= \min\left(f(x)q(y|x),\; f(y)q(x|y)\right)\\ &= \min\left(f(y)q(x|y),\; f(x)q(y|x)\right)\\ &= f(y)\tilde{q}(x|y). \end{align}\]

Gibbs sampler

Gibbs sampling (a.k.a. alternating conditional sampling) is a special case of the Metropolis-Hastings algorithm and widely used to address multidimensional problems. It is particularly suitable for Bayesian hierarchical models due to the modularised model construction. Suppose \(x\) is a vector in a multidimensional space and divided into \(d\) sub-vectors \(x=(x^{(1)},\dots,x^{(d)})\). (n.b. Each sub-vector need not be one-dimensional.) Let \(f^{(k)}\) be the conditional PDF of \(x^{(k)}\) \((k\in\{1,\dots,d\})\) given the values of the other sub-vectors: \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] Gibbs sampler assumes we can sample from each of these conditional PDFs, and proceeds as follows.

  1. Sample \(x_{i+1}^{(1)}\) from \(f^{(1)}(x^{(1)}|x_i^{(2)},\dots,x_i^{(d)})\).
  2. Sample \(x_{i+1}^{(2)}\) from \(f^{(2)}(x^{(2)}|x_{i+1}^{(1)}, x_i^{(3)},\dots,x_i^{(d)})\).
  3. Sample \(x_{i+1}^{(3)}\) from \(f^{(3)}(x^{(3)}|x_{i+1}^{(1)},x_{i+1}^{(2)} x_i^{(4)},\dots,x_i^{(d)})\).
    \(\vdots\)
  1. Sample \(x_{i+1}^{(d)}\) from \(f^{(d)}(x^{(d)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(d-1)})\).

You get the idea. \(x_{i+1} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(d)})\) is technically a proposal in the context of the Metropolis-Hastings algorithm, but \(x_{i+1}\) is always accepted in Gibbs sampler.

Go back to the demos above and see whether the code makes sense.

Stationarity and ergodicity

It is straightforward to recognise \(f\) as stationary under Gibbs sampling. Consider sampling \(x^{(k)}\) using \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] First, the marginal \(f^{(-k)}(x^{(-k)})\) where \(x^{(-k)} = (x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)})\) remains invariant because no sampling occurs for \(x^{(-k)}\). Next, by definition, Gibbs sampler draws a sample of \(x^{(k)}\) from the correct conditional distribution \(f^{(k)}(x^{(k)}|x^{(-k)})\). Thus, the joint PDF \[f(x) = f^{(k)}(x^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\] remains invariant under Gibbs sampling.

How about the ergodicity? Here is a sufficient condition for irreducibility under which we often use Gibbs sampler. Let \(\Omega\) be the support of the joint distribution \(f(x)\). Then, if the support of \(f^{(k)}(x^{(k)}|x^{(-k)})\) coincides with the slice of \(\Omega\) at an arbitrary \(x^{(-k)}\), then it means that every \(x^{(k)}\) can be sampled from any \(x\). This is clearly the case in the above demos where the support of each conditional normal distribution is the entire real line. For aperiodicity, at this level, you may take it for granted in almost any non-trivial applications.

Special case of Metropolis-Hastings

Looking at a sample \(x_{i+1}^{(k)}\) drawn from \[f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}),\] we may put \(x_{i+1}^{(k)}\) in a full-length vector \(x\) and index it as \(j+1\). That is, \[\begin{align} \vdots&\\ x_{j} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ x_{j+1} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i+1}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ \vdots& \end{align}\]

Note that the transition densities of \(x_j\to x_{j+1}\) and \(x_{j+1}\to x_j\) are both given by the conditional PDF for \(x^{(k)}\), \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\\ q(x_j|x_{j+1}) &= f^{(k)}(x_{i}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i}^{(k)}|x^{(-k)}), \end{align}\] where we use \(x^{(-k)} = x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}\). Now, it follows that \[\begin{align} \alpha(x_{j},x_{j+1}) &= \min\left(1,\; \frac{f(x_{j+1})\cdot q(x_{j}|x_{j+1})}{f(x_{j})\cdot q(x_{j+1}|x_{j})}\right)\\ &= \min\left(1,\; \frac{f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i}^{(k)}|x^{(-k)})}{f^{(k)}(x_{i}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})}\right)\\ &= \min(1, 1)\\ &= 1 \end{align}\] Hence, a proposal is always accepted.

Example

Imagine you try to model the number of customers at each of two bars in Melbourne using Poisson distribution, namely \(\text{pois}(\lambda_1)\) for Bar 1 and \(\text{pois}(\lambda_2)\) for Bar 2. Based on other information, you think \(\lambda_1\) and \(\lambda_2\) are distinct but somehow related. So, you assume both \(\lambda_1\) and \(\lambda_2\) to be distributed as \(\text{exp}(\theta)\). Finally, you complete the model by simply assuming \(\theta \sim \text{exp}(1)\).

Suppose you are going to collect \(m_1\) samples from Bar 1 and \(m_2\) samples from Bar 2. Then, your model is:

\[\begin{gather} X_1, X_2, \dots, X_{m_1} \sim \text{pois}(\lambda_1)\\ X_{m_1+1}, X_{m_1+2}, \dots, X_{m_1+m_2} \sim \text{pois}(\lambda_2)\\ \lambda_1, \lambda_2 \sim \text{exp}(\theta)\\ \theta \sim \text{exp}(1) \end{gather}\] (As usual, the observations are independent. In addition, \(\lambda_1\) and \(\lambda_2\) are independent given \(\theta\).)

You naturally expect learning something about the average numbers of customers at Bar 1 and Bar 2 (\(\lambda_1\) and \(\lambda_2\)) and how they are related.

Given the collected data \[\mathbf{x}=(\mathbf{x}_1,\mathbf{x}_2)=(x_1,\dots,x_{m_1},x_{m_1+1},\dots,x_{m_1+m_2}),\] you derive \(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\), a joint PDF of \((\lambda_1, \lambda_2, \theta)\) conditional on \(\mathbf{x}\).

\[\begin{align} f(\lambda_1,\lambda_2,\theta|\mathbf{x}) &= \frac{f(\mathbf{x},\lambda_1,\lambda_2,\theta)}{f(\mathbf{x})}\\ &\propto f(\mathbf{x},\lambda_1,\lambda_2,\theta)\\ &= f(\mathbf{x}|\lambda_1,\lambda_2,\theta)\cdot f(\lambda_1,\lambda_2,\theta)\\ &= f((\mathbf{x}_1,\mathbf{x}_2)|\lambda_1,\lambda_2)\cdot f(\lambda_1,\lambda_2|\theta)\cdot f(\theta)\\ &= f(\mathbf{x}_1|\lambda_1)\cdot f(\mathbf{x}_2|\lambda_2)\cdot f(\lambda_1|\theta)\cdot f(\lambda_2|\theta)\cdot f(\theta)\\ &= \prod_{i=1}^{m_1}f(x_i|\lambda_1) \cdot \prod_{j=1}^{m_2}f(x_j|\lambda_2) \cdot f(\lambda_1|\theta) \cdot f(\lambda_2|\theta) \cdot f(\theta)\\ &= \prod_{i=1}^{m_1}\frac{e^{-\lambda_1}\lambda_1^{x_i}}{x_i!}\cdot\prod_{j=1}^{m_2}\frac{e^{-\lambda_2}\lambda_2^{x_j}}{x_j!}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &\propto e^{-m_1\lambda_1}\lambda_1^{\sum_{i=1}^{m_1}x_i}\cdot e^{-m_2\lambda_2}\lambda_2^{\sum_{j=1}^{m_2}x_j}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &= \theta^2 \cdot e^{-\theta} \cdot e^{-(\theta+m_1)\lambda_1}\cdot e^{-(\theta+m_2)\lambda_2} \cdot \lambda_1^{\sum_{i=1}^{m_1}x_i} \cdot \lambda_2^{\sum_{j=1}^{m_2}x_j} \end{align}\]

\(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\) does not look like any familiar PDF.

Let’s implement a Gibbs sampler to sample from this joint PDF. First, you need to identify the following full conditionals from the above joint (perhaps, by purposefully organising variables and eyeballing).

\[\begin{align} f(\lambda_1|\lambda_2,\theta,\mathbf{x}) &\propto \lambda_1^{\sum_{i=1}^{m_1}x_i} e^{-(\theta+m_1)\lambda_1} \sim \text{Gamma}\left(1+\sum_{i=1}^{m_1}x_i, \theta+m_1\right)\\ f(\lambda_2|\lambda_1,\theta,\mathbf{x}) &\propto \lambda_2^{\sum_{j=1}^{m_2}x_j} e^{-(\theta+m_2)\lambda_2} \sim \text{Gamma}\left(1+\sum_{j=1}^{m_2}x_j, \theta+m_2\right)\\ f(\theta|\lambda_1,\lambda_2,\mathbf{x}) &\propto \theta^2 e^{-(1+\lambda_1+\lambda_2)\theta} \sim \text{Gamma}(3, 1+\lambda_1+\lambda_2). \end{align}\]

Then, you cycle through them for \(n\) rounds of sampling and return a \(n\times 3\) matrix.

sampler <- function(n, x1, x2) {
  set.seed(90045)
  
  m1 <- length(x1)
  m2 <- length(x2)
  s1 <- sum(x1)
  s2 <- sum(x2)
  D <- matrix(0, n, 3)
  for (i in 2:n) {
    D[i,1] <- rgamma(1, 1+s1, m1+D[i-1,3])
    D[i,2] <- rgamma(1, 1+s2, m2+D[i-1,3])
    D[i,3] <- rgamma(1, 3, 1+D[i,1]+D[i,2])
  }
  
  return(D)
}

For example,

x1 <- c(10,9,6,19,8,13,15,7,11)
x2 <- c(14,10,11,9,9,6,13,19,10,8,6,13,8,15,21,7,12,11)
n <- 1e6
D <- sampler(n, x1, x2)
burn_in <- floor(0.05*n) # throwing away 1st 5% of samples
D <- D[(burn_in+1):n,]

Using the samples D, you can approximate many quantities. For example,

  • the mean of \(\lambda_1\),
  • the mean of \(\lambda_2\), and
  • the correlation coefficient between \(\lambda_1\) and \(\lambda_2\).

\[\begin{align} \mu_1 &= \mathbb{E}[\lambda_1|\mathbf{x}]\\ \mu_2 &= \mathbb{E}[\lambda_2|\mathbf{x}]\\ \sigma^2_1 &= \mathbb{E}[(\lambda_1-\mu_1)^2|\mathbf{x}]\\ \sigma^2_2 &= \mathbb{E}[(\lambda_1-\mu_2)^2|\mathbf{x}]\\ \text{Cov}(\lambda_1,\lambda_2) &= \mathbb{E}[(\lambda_1-\mu_1)(\lambda_2-\mu_2)|\mathbf{x}]\\ \rho &= \frac{\text{Cov}(\lambda_1,\lambda_2)}{\sigma_1\sigma_2} \end{align}\]

mu1 <- mean(D[,1])
mu2 <- mean(D[,2])
sig1 <- mean((D[,1]-mu1)^2)
sig2 <- mean((D[,2]-mu2)^2)
cov <- mean((D[,1]-mu1)*(D[,2]-mu2))
rho <- cov/sqrt(sig1*sig2)

Results are \[\hat{\mu}_1 = 10.84,\; \hat{\mu}_2 = 11.2,\; \hat{\rho} = 0.0047.\] It seems two bars have very similar average customers. However, as far as the correlation is concerned, they are unrelated.

References

  • Sec 11.2. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
  • Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of Markov chain Monte Carlo. CRC press.
  • Ch 11. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
  • Sec 6.5 & 6.14. Grimmett, G. and Stirzaker, D. (2001). Probability and Random Processes. Oxford University Press.
  • Roberts, G. O., & Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability surveys, 1, 20-71.